We recall, we were looking to see if there was systematic discrimination in Chicago in the 1970s with insurance companies denying home insurance to minority residents.
We have data that is difficult to analyze due to the ecological fallacy:
However, we don't know who was accepted or denied in the normal insurance market
We wish to disentagle the effect of percent minority residents in a neighborhood from other factors as well. Specifically:
theft rates;
fire rates;
the age of the neighborhood; and
the median income of the neighborhood.
library(faraway)
summary(chredlin)
race fire theft age
Min. : 1.00 Min. : 2.00 Min. : 3.00 Min. : 2.00
1st Qu.: 3.75 1st Qu.: 5.65 1st Qu.: 22.00 1st Qu.:48.60
Median :24.50 Median :10.40 Median : 29.00 Median :65.00
Mean :34.99 Mean :12.28 Mean : 32.36 Mean :60.33
3rd Qu.:57.65 3rd Qu.:16.05 3rd Qu.: 38.00 3rd Qu.:77.30
Max. :99.70 Max. :39.70 Max. :147.00 Max. :90.10
involact income side
Min. :0.0000 Min. : 5.583 n:25
1st Qu.:0.0000 1st Qu.: 8.447 s:22
Median :0.4000 Median :10.694
Mean :0.6149 Mean :10.696
3rd Qu.:0.9000 3rd Qu.:11.989
Max. :2.2000 Max. :21.480
lmod <- lm(involact ~ race + fire + theft + age + log(income), data=chredlin)
This was chosen based upon criterion methods, and some qualitative selection of the log-income variable.
Specifically, we noted that log-income and percent minority in a neighborhood of Chicago is highly anti-correlated;
In this case, we found that the race parameter was significant, but to be careful about our conclusions and explanations, we performed the usual diagnostics to test the goodness of the model.
We note that there is a structural issue arising due to the high number of zero FAIR plans (the response variable), but that there isn't a realistic choice of scale transformation to fix this.
Normality and variance assumptions of the error are otherwise OK.
We will probably need to evaluate the effect of points of high leverage in the analysis.
We wish thus to study the sensitivity of the main parameter of interest “race”, with respect to the leverage points and controlling for covariates in the analysis.
We will start with looking at the partial residuals for race and fire to see if, when controlling for the other factors, we have a good linear fit.
Likewise, we will look at the confidence intervals for these parameters, to judge some of the uncertainty in the estimate.
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
termplot(lmod, partial.resid=TRUE, terms=1, cex=3, cex.lab=3, cex.axis=1.5)
coefficients(lmod)[2]
race
0.009502223
confint(lmod, parm = "race")
2.5 % 97.5 %
race 0.004474458 0.01452999
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
termplot(lmod, partial.resid=TRUE, terms=2, cex=3, cex.lab=3, cex.axis=1.5)
coefficients(lmod)[3]
fire
0.03985604
confint(lmod, parm = "fire")
2.5 % 97.5 %
fire 0.02215246 0.05755963
For each of the most significant parameters, there isn't a great deal of uncertainty in the effect on the response, and the structure appears to be fine in the partial residuals.
Loosely speaking, it appears that there is a small, but statistically significant, effect in which neighborhoods with higher concentrations of ethnic minorities have a higher number of FAIR plans, when all other variables are held equal.
This effect is not nearly as strong as e.g., the rate of fires with all other variables held equal.
Our analysis is not even nearly complete, and we cannot say if any application for insurance was rejected based upon their ethnic background;
Generally, we should also perform the partial residuals for the other variables, but these will be supressed here for brevity.
In terms of analyzing the sensitivity of the race parameter with respect to the inclusion or exclusion of other covariates, we can automate this to determine if the effect on the response changes dramatically.
This is similar to the case when we used matching in the voting districts in New Hampshire, where we needed to see if an additional covariate was acting as a confounding variable on our analysis.
By analyzing the inter-model stability of this parameter, we have a better sense of if this explanation is actually robust.
beta pvalue
race 0.0139 0.0000
race + fire 0.0089 0.0002
race + theft 0.0141 0.0000
race + age 0.0123 0.0000
race + log ( income ) 0.0082 0.0087
race + fire + theft 0.0082 0.0002
race + fire + age 0.0089 0.0001
race + fire + log ( income ) 0.0070 0.0160
race + theft + age 0.0128 0.0000
race + theft + log ( income ) 0.0084 0.0083
race + age + log ( income ) 0.0099 0.0017
race + fire + theft + age 0.0081 0.0001
race + fire + theft + log ( income ) 0.0073 0.0078
race + fire + age + log ( income ) 0.0085 0.0041
race + theft + age + log ( income ) 0.0106 0.0010
race + fire + theft + age + log ( income ) 0.0095 0.0004
The above output shows the parameter for race and the associated p-values for all 16 models.
There is some variance in the magnitude of the effect, depending on the other variables we control for, but in no case does the p-value rise above \( 5\% \) or does the parameter change sign.
This suggests some uncertainty over the magnitude of the effect, we can be sure that the significance of the effect is not sensitive to the choice of adjusters.
If we suppose that we were able to find models in which the composition of the neighborhood in terms of minorities was not statistically significant, our ability to control for other factors would be more difficult.
We would then need to consider more deeply which covariates should be adjusted for and which not.
This level of model analysis and selection is at the heart of many real world problems, and we are fortunate as modellers in this case that it is unnecessary.
In general, however, this is the reality of real problems you will encounter after this class.
Now, we want to analyze the effect of leverage and influence on the model and our interpretation.
We recall that there are several influential points, and we wish to determine if their inclusion or exclusion would radically change the interpretation for the parameter for race.
par(mai=c(1.5,1.5,.5,.5), mgp=c(3,0,0))
plot(dfbeta(lmod)[,2],ylab="Change in race coef", cex=3, cex.lab=3, cex.axis=1.5)
abline(h=0)
diags <- data.frame(lm.influence(lmod)$coef)
library(ggplot2)
ggplot(diags,aes(row.names(diags), race)) + geom_linerange(aes(ymax=0, ymin=race)) + theme(axis.text.x=element_text(angle=90), axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold")) + xlab("ZIP") + geom_hline(yintercept = 0)
ggplot(diags,aes(x=fire,y=theft))+geom_text(label=row.names(diags)) +
theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))
chredlin[c("60607","60610"),]
race fire theft age involact income side
60607 50.2 39.7 147 83.0 0.9 7.459 n
60610 54.0 34.1 68 52.6 0.3 8.231 n
lmode <- lm(involact ~ race + fire + theft + age + log(income), chredlin,subset=-c(6,24))
sumary(lmode)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.5767365 1.0800462 -0.5340 0.59638
race 0.0070527 0.0026960 2.6160 0.01259
fire 0.0496474 0.0085701 5.7931 1.004e-06
theft -0.0064336 0.0043489 -1.4794 0.14708
age 0.0051709 0.0028947 1.7863 0.08182
log(income) 0.1157028 0.4011132 0.2885 0.77453
n = 45, p = 6, Residual SE = 0.30320, R-Squared = 0.8
We have verified that our conclusions are also robust to the exclusion of one or perhaps two cases from the data.
In any case, it would be very important to disclose this choice in the analysis, and any suppression of the information would be extremely dishonest.
x <- model.matrix(lmod)
x0 <- apply(x,2,median)
x0 <- data.frame(t(x0))
# note here that R won't accept the variable that we've rescaled until we re-name it back to its original name...
colnames(x0)[6] <- "income"
predict(lmod,new=x0,interval="prediction")
fit lwr upr
1 0.003348934 -1.398822 1.40552
predict(lmod,new=x0,interval="confidence")
fit lwr upr
1 0.003348934 -1.225333 1.232031
We have shown very robust evidence for the statistically significant effect of the neighborhood's ethnic composition, with all other variables held equal, having a significant effect in determining the number of FAIR plans in the neighborhood.
This doesn't mean conclusively, however, that there was systematic discrimination.
If we tried another hypothetical model, supressing theft and the two high influence observations:
modalt <- lm(involact ~ race+fire+log(income),chredlin, subset=-c(6,24))
sumary(modalt)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.7532557 0.8358798 0.9012 0.37277
race 0.0042061 0.0022757 1.8483 0.07178
fire 0.0510220 0.0084504 6.0378 3.822e-07
log(income) -0.3623825 0.3191620 -1.1354 0.26279
n = 45, p = 4, Residual SE = 0.30919, R-Squared = 0.79
the parameter for race is no longer significant.
This is just a hypothetical example, because our relatively exhaustive analysis has shown that this is a “cherry-picked” model, versus the other possible models.
Generally, however, this shows some of the subtlety in statistical analysis like this
Part of robust analysis is thus to perform several model selections and the associated diagnostics to deal with the overall uncertainty in the model selection;
The strength of our conclusions will increase if we can show that several different models all have similar interpretations.
If there are several reasonable models with different interpretations or conclusions, there is a fundamental issue in the data, and we should be cautious and be skeptical if there is really a signal to be found in the data.
Generally, we have performed rigorous analysis in the above (pending completion of additional analysis).
However, there are still several ways we should be cautious about our conclusions:
lmod <- lm(involact ~ race+fire+theft+age+log(income), subset=(side == "s"),chredlin)
sumary(lmod)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.9946613 1.6949252 -1.1768 0.256470
race 0.0093509 0.0046055 2.0304 0.059285
fire 0.0510812 0.0170314 2.9992 0.008493
theft -0.0102565 0.0090972 -1.1274 0.276184
age 0.0067778 0.0053061 1.2774 0.219700
log(income) 0.6810543 0.6493336 1.0489 0.309832
n = 22, p = 6, Residual SE = 0.34981, R-Squared = 0.76
lmod <- lm(involact ~ race+fire+theft+age+log(income), subset=(side == "n"),chredlin)
sumary(lmod)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.7902939 1.7793768 -0.4441 0.66196
race 0.0133200 0.0053903 2.4711 0.02310
fire 0.0246654 0.0154219 1.5994 0.12623
theft -0.0079383 0.0039815 -1.9938 0.06073
age 0.0090040 0.0046471 1.9375 0.06769
log(income) 0.1666853 0.6233641 0.2674 0.79204
n = 25, p = 6, Residual SE = 0.35115, R-Squared = 0.76
Generally, sub-dividing data into smaller and smaller groups, we can dillute the significance.
However, as noted before, aggregating data too much causes its own issues – the balance between these is contextual and subjective based on our understanding of the task at hand.
ggplot(chredlin,aes(side,race)) + geom_point(position = position_jitter(width = .2,height=0)) + theme(axis.text=element_text(size=20), axis.title=element_text(size=20,face="bold"))
summary(chredlin$race[chredlin$side=="n"])
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 1.80 10.00 21.95 24.50 94.40
summary(chredlin$race[chredlin$side=="s"])
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 34.27 48.10 49.80 72.92 99.70
We performed initial, exploratory analysis, checking for structure in the data, and in the plots of variables versus the response.
We fit a model based on a mix of criterion methods and some qualitative judgement.
We performed diagnostics to see how our assumptions might be breaking down – aside from an unavoidable issue with the many zero observations for the FAIR plan, it was a well functioning model.
We evaluated the partial residuals, and confidence intervals, for the main parameter of interest – if this was more complete, we would do this for each of them.
As this is an explanatory model, we were interested in “predicting” the parameter for “race”;
We also demonstrated robustness to the exclusion of observations, and checked formally for outliers that could affect the fit.
Just for fun, we saw that it has awful predictive power…
We made a final assessment of the uncertainty that we could not account for with our data.
After all this analysis, everything we say needs to be hedged with “ifs” and “buts” and qualified carefully.
This is the reality of statistical modeling, where causual conclusions are extremely difficult to draw.
To quote Farway, quoting Winston Churchill:
“Indeed, it has been said that Democracy is the worst form of government, except all those other forms that have been tried from time to time.”
We might say the same about statistics with respect to how it helps us reason in the face of uncertainty.